Àá½Ã¸¸ ±â´Ù·Á ÁÖ¼¼¿ä. ·ÎµùÁßÀÔ´Ï´Ù.
KMID : 1022420200120030055
Phonetics and Speech Sciences
2020 Volume.12 No. 3 p.55 ~ p.63
Voice-to-voice conversion using transformer network
Kim June-Woo

Jung Ho-Young
Abstract
Voice conversion can be applied to various voice processing applications. It can also play an important role in data augmentation for speech recognition. The conventional method uses the architecture of voice conversion with speech synthesis, with Mel filter bank as the main parameter. Mel filter bank is well-suited for quick computation of neural networks but cannot be converted into a high-quality waveform without the aid of a vocoder. Further, it is not effective in terms of obtaining data for speech recognition. In this paper, we focus on performing voice-to-voice conversion using only the raw spectrum. We propose a deep learning model based on the transformer network, which quickly learns the voice conversion properties using an attention mechanism between source and target spectral components. The experiments were performed on TIDIGITS data, a series of numbers spoken by an English speaker. The conversion voices were evaluated for naturalness and similarity using mean opinion score (MOS) obtained from 30 participants. Our final results yielded 3.52¡¾0.22 for naturalness and 3.89¡¾0.19 for similarity.
KEYWORD
voice conversion, transformer network, signal-to-signal conversion
FullTexts / Linksout information
Listed journal information
ÇмúÁøÈïÀç´Ü(KCI)